Exploratory Data Analysis¶

EDA is the basic step to analyze and perform intial examination.

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier 
from natsort import index_natsorted

Intially, we import the most fundamental libraries that will help us to perform EDA and Ml algorithm.

In [3]:
data_icu = pd.read_csv('Kaggle_Sirio_Libanes_ICU_Prediction.csv')

We use the pandas command to read our.csv file and declare our dataset as data_icu.

In [4]:
data_icu.head()
Out[4]:
PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 AGE_PERCENTIL GENDER DISEASE GROUPING 1 DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5 DISEASE GROUPING 6 ... TEMPERATURE_DIFF OXYGEN_SATURATION_DIFF BLOODPRESSURE_DIASTOLIC_DIFF_REL BLOODPRESSURE_SISTOLIC_DIFF_REL HEART_RATE_DIFF_REL RESPIRATORY_RATE_DIFF_REL TEMPERATURE_DIFF_REL OXYGEN_SATURATION_DIFF_REL WINDOW ICU
0 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0-2 0
1 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 2-4 0
2 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 4-6 0
3 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 ... -1.000000 -1.000000 NaN NaN NaN NaN -1.000000 -1.000000 6-12 0
4 0 1 60th 0 0.0 0.0 0.0 0.0 1.0 1.0 ... -0.238095 -0.818182 -0.389967 0.407558 -0.230462 0.096774 -0.242282 -0.814433 ABOVE_12 1

5 rows × 231 columns

data_icu's first five rows are displayed by using .head() function.

In [5]:
data_icu.tail()
Out[5]:
PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 AGE_PERCENTIL GENDER DISEASE GROUPING 1 DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5 DISEASE GROUPING 6 ... TEMPERATURE_DIFF OXYGEN_SATURATION_DIFF BLOODPRESSURE_DIASTOLIC_DIFF_REL BLOODPRESSURE_SISTOLIC_DIFF_REL HEART_RATE_DIFF_REL RESPIRATORY_RATE_DIFF_REL TEMPERATURE_DIFF_REL OXYGEN_SATURATION_DIFF_REL WINDOW ICU
1920 384 0 50th 1 0.0 0.0 0.0 0.0 0.0 0.0 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0-2 0
1921 384 0 50th 1 0.0 0.0 0.0 0.0 0.0 0.0 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 2-4 0
1922 384 0 50th 1 0.0 0.0 0.0 0.0 0.0 0.0 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 4-6 0
1923 384 0 50th 1 0.0 0.0 0.0 0.0 0.0 0.0 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 6-12 0
1924 384 0 50th 1 0.0 0.0 1.0 0.0 0.0 0.0 ... -0.547619 -0.838384 -0.701863 -0.585967 -0.763868 -0.612903 -0.551337 -0.835052 ABOVE_12 0

5 rows × 231 columns

data_icu's last five rows are displayed by using .tail() function.

In [6]:
(data_icu.shape)
Out[6]:
(1925, 231)

.shape was used to check the shape of data set like number of rows and columns,and it shows that (rows,coloumns): (1925,231)

In [7]:
def agr_perc_to_int(percentil):
    if percentil == "Above 90th":
        return (100)
    else:
        return(int("".join(c for c in str(percentil) if c.isdigit())))
In [8]:
data_icu["AGE_PERCENTIL"] = data_icu.AGE_PERCENTIL.apply(lambda data_icu: agr_perc_to_int(data_icu))
set(data_icu["AGE_PERCENTIL"].values)
Out[8]:
{10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
In [9]:
def cat_window (window):
    if window == "ABOVE_12":
        return(13)
    else:
        return(int((window.split("-") [1])))
    
data_icu['WINDOW'] = data_icu['WINDOW'].apply(lambda x: cat_window(x))
data_icu['WINDOW'].isnull().sum()
Out[9]:
0

In the Above function it found that two columns were as objects (AGE_PERCENTIL and WINDOW), To avoid any numerical issues, it is suggested to convert them to numerical data or string to float. Above methods convert string into float.

In [10]:
data_icu.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1925 entries, 0 to 1924
Columns: 231 entries, PATIENT_VISIT_IDENTIFIER to ICU
dtypes: float64(225), int64(6)
memory usage: 3.4 MB

Using the info() function, we can determine if the data is an integer or a float.

In [11]:
data_icu.describe()
Out[11]:
PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 AGE_PERCENTIL GENDER DISEASE GROUPING 1 DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5 DISEASE GROUPING 6 ... TEMPERATURE_DIFF OXYGEN_SATURATION_DIFF BLOODPRESSURE_DIASTOLIC_DIFF_REL BLOODPRESSURE_SISTOLIC_DIFF_REL HEART_RATE_DIFF_REL RESPIRATORY_RATE_DIFF_REL TEMPERATURE_DIFF_REL OXYGEN_SATURATION_DIFF_REL WINDOW ICU
count 1925.000000 1925.000000 1925.000000 1925.000000 1920.000000 1920.000000 1920.000000 1920.000000 1920.000000 1920.000000 ... 1231.000000 1239.000000 1240.000000 1240.000000 1240.000000 1177.000000 1231.000000 1239.000000 1925.000000 1925.000000
mean 192.000000 0.467532 53.194805 0.368831 0.108333 0.028125 0.097917 0.019792 0.128125 0.046875 ... -0.770338 -0.887196 -0.786997 -0.715950 -0.817800 -0.719147 -0.771327 -0.886982 7.400000 0.267532
std 111.168431 0.499074 28.673479 0.482613 0.310882 0.165373 0.297279 0.139320 0.334316 0.211426 ... 0.319001 0.296147 0.324754 0.419103 0.270217 0.446600 0.317694 0.296772 4.364619 0.442787
min 0.000000 0.000000 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 2.000000 0.000000
25% 96.000000 0.000000 30.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 4.000000 0.000000
50% 192.000000 0.000000 50.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... -0.976190 -0.979798 -1.000000 -0.984944 -0.989822 -1.000000 -0.975924 -0.980333 6.000000 0.000000
75% 288.000000 1.000000 80.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... -0.595238 -0.878788 -0.645482 -0.522176 -0.662529 -0.634409 -0.594677 -0.880155 12.000000 1.000000
max 384.000000 1.000000 100.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 13.000000 1.000000

8 rows × 231 columns

data_icu describe() use for displays summary statistics for a python dataframe.

In [12]:
 data_icu.count()
Out[12]:
PATIENT_VISIT_IDENTIFIER      1925
AGE_ABOVE65                   1925
AGE_PERCENTIL                 1925
GENDER                        1925
DISEASE GROUPING 1            1920
                              ... 
RESPIRATORY_RATE_DIFF_REL     1177
TEMPERATURE_DIFF_REL          1231
OXYGEN_SATURATION_DIFF_REL    1239
WINDOW                        1925
ICU                           1925
Length: 231, dtype: int64

.count() function was used to check the values which are of given variable in our data set(data_icu).

In [13]:
data_icu.nunique()
Out[13]:
PATIENT_VISIT_IDENTIFIER      385
AGE_ABOVE65                     2
AGE_PERCENTIL                  10
GENDER                          2
DISEASE GROUPING 1              2
                             ... 
RESPIRATORY_RATE_DIFF_REL     200
TEMPERATURE_DIFF_REL          457
OXYGEN_SATURATION_DIFF_REL    187
WINDOW                          5
ICU                             2
Length: 231, dtype: int64

The nunique()function returns the number of unique values for each column.

In [14]:
data_icu.columns
Out[14]:
Index(['PATIENT_VISIT_IDENTIFIER', 'AGE_ABOVE65', 'AGE_PERCENTIL', 'GENDER',
       'DISEASE GROUPING 1', 'DISEASE GROUPING 2', 'DISEASE GROUPING 3',
       'DISEASE GROUPING 4', 'DISEASE GROUPING 5', 'DISEASE GROUPING 6',
       ...
       'TEMPERATURE_DIFF', 'OXYGEN_SATURATION_DIFF',
       'BLOODPRESSURE_DIASTOLIC_DIFF_REL', 'BLOODPRESSURE_SISTOLIC_DIFF_REL',
       'HEART_RATE_DIFF_REL', 'RESPIRATORY_RATE_DIFF_REL',
       'TEMPERATURE_DIFF_REL', 'OXYGEN_SATURATION_DIFF_REL', 'WINDOW', 'ICU'],
      dtype='object', length=231)

We observed columns in datasets by using the.columns function.

Data Dublication¶

Data Quality Issues¶

In [15]:
data_icu.drop_duplicates()
data_icu.shape
Out[15]:
(1925, 231)
In [16]:
 data_icu.duplicated(subset=None, keep="first")
Out[16]:
0       False
1       False
2       False
3       False
4       False
        ...  
1920    False
1921    False
1922    False
1923    False
1924    False
Length: 1925, dtype: bool

.drop_duplicates method is used to remove duplicate or double column and its show that in our data set their is no duplicate column.

In [17]:
data_icu = data_icu.drop_duplicates(keep='first')
In [18]:
# Use for dropping Null values
data = data_icu.dropna(axis=1,how="all")
data_icu.shape
Out[18]:
(1925, 231)

.dropna method are used to drop Null values and this function remove a entire column in which every value is Null.So, its shown that in our data set thier no such column which contain every value null.

In [19]:
data_icu.isnull().sum()
Out[19]:
PATIENT_VISIT_IDENTIFIER        0
AGE_ABOVE65                     0
AGE_PERCENTIL                   0
GENDER                          0
DISEASE GROUPING 1              5
                             ... 
RESPIRATORY_RATE_DIFF_REL     748
TEMPERATURE_DIFF_REL          694
OXYGEN_SATURATION_DIFF_REL    686
WINDOW                          0
ICU                             0
Length: 231, dtype: int64

.isnull() function is used to check null values of our data set and using it with .sum() the null values will repsent in tabulor form as shown above . It appear that four variable countain null values .

Missing Values¶

In [20]:
def _impute_missing_data(data_icu):
    return data_icu.replace(-1, np.nan)
data_icu = _impute_missing_data(data_icu)
In [21]:
print('NaN values = ', data_icu.isnull().sum().sum())
print("""""")
      
vars_with_missing = []
      
for feature in data_icu.columns:
      missings = data_icu[feature].isna().sum()
      
      if missings > 0 :
          vars_with_missing.append(feature)
          missings_perc = missings / data_icu.shape[0]
          
          print ('Variable {} has{} records ({:.2%}) with missing values.'.format(feature, missings, missings_perc))
print('In total, there are {} variables with missing values'.format(len(vars_with_missing)))
NaN values =  269697

Variable DISEASE GROUPING 1 has5 records (0.26%) with missing values.
Variable DISEASE GROUPING 2 has5 records (0.26%) with missing values.
Variable DISEASE GROUPING 3 has5 records (0.26%) with missing values.
Variable DISEASE GROUPING 4 has5 records (0.26%) with missing values.
Variable DISEASE GROUPING 5 has5 records (0.26%) with missing values.
Variable DISEASE GROUPING 6 has5 records (0.26%) with missing values.
Variable HTN has5 records (0.26%) with missing values.
Variable IMMUNOCOMPROMISED has5 records (0.26%) with missing values.
Variable OTHER has5 records (0.26%) with missing values.
Variable ALBUMIN_MEDIAN has1105 records (57.40%) with missing values.
Variable ALBUMIN_MEAN has1105 records (57.40%) with missing values.
Variable ALBUMIN_MIN has1105 records (57.40%) with missing values.
Variable ALBUMIN_MAX has1105 records (57.40%) with missing values.
Variable ALBUMIN_DIFF has1925 records (100.00%) with missing values.
Variable BE_ARTERIAL_MEDIAN has1850 records (96.10%) with missing values.
Variable BE_ARTERIAL_MEAN has1850 records (96.10%) with missing values.
Variable BE_ARTERIAL_MIN has1850 records (96.10%) with missing values.
Variable BE_ARTERIAL_MAX has1850 records (96.10%) with missing values.
Variable BE_ARTERIAL_DIFF has1925 records (100.00%) with missing values.
Variable BE_VENOUS_MEDIAN has1687 records (87.64%) with missing values.
Variable BE_VENOUS_MEAN has1687 records (87.64%) with missing values.
Variable BE_VENOUS_MIN has1687 records (87.64%) with missing values.
Variable BE_VENOUS_MAX has1687 records (87.64%) with missing values.
Variable BE_VENOUS_DIFF has1925 records (100.00%) with missing values.
Variable BIC_ARTERIAL_MEDIAN has1105 records (57.40%) with missing values.
Variable BIC_ARTERIAL_MEAN has1105 records (57.40%) with missing values.
Variable BIC_ARTERIAL_MIN has1105 records (57.40%) with missing values.
Variable BIC_ARTERIAL_MAX has1105 records (57.40%) with missing values.
Variable BIC_ARTERIAL_DIFF has1925 records (100.00%) with missing values.
Variable BIC_VENOUS_MEDIAN has1105 records (57.40%) with missing values.
Variable BIC_VENOUS_MEAN has1105 records (57.40%) with missing values.
Variable BIC_VENOUS_MIN has1105 records (57.40%) with missing values.
Variable BIC_VENOUS_MAX has1105 records (57.40%) with missing values.
Variable BIC_VENOUS_DIFF has1925 records (100.00%) with missing values.
Variable BILLIRUBIN_MEDIAN has1105 records (57.40%) with missing values.
Variable BILLIRUBIN_MEAN has1105 records (57.40%) with missing values.
Variable BILLIRUBIN_MIN has1105 records (57.40%) with missing values.
Variable BILLIRUBIN_MAX has1105 records (57.40%) with missing values.
Variable BILLIRUBIN_DIFF has1925 records (100.00%) with missing values.
Variable BLAST_MEDIAN has1918 records (99.64%) with missing values.
Variable BLAST_MEAN has1918 records (99.64%) with missing values.
Variable BLAST_MIN has1918 records (99.64%) with missing values.
Variable BLAST_MAX has1918 records (99.64%) with missing values.
Variable BLAST_DIFF has1925 records (100.00%) with missing values.
Variable CALCIUM_MEDIAN has1105 records (57.40%) with missing values.
Variable CALCIUM_MEAN has1105 records (57.40%) with missing values.
Variable CALCIUM_MIN has1105 records (57.40%) with missing values.
Variable CALCIUM_MAX has1105 records (57.40%) with missing values.
Variable CALCIUM_DIFF has1925 records (100.00%) with missing values.
Variable CREATININ_MEDIAN has1105 records (57.40%) with missing values.
Variable CREATININ_MEAN has1105 records (57.40%) with missing values.
Variable CREATININ_MIN has1105 records (57.40%) with missing values.
Variable CREATININ_MAX has1105 records (57.40%) with missing values.
Variable CREATININ_DIFF has1925 records (100.00%) with missing values.
Variable FFA_MEDIAN has1105 records (57.40%) with missing values.
Variable FFA_MEAN has1105 records (57.40%) with missing values.
Variable FFA_MIN has1105 records (57.40%) with missing values.
Variable FFA_MAX has1105 records (57.40%) with missing values.
Variable FFA_DIFF has1925 records (100.00%) with missing values.
Variable GGT_MEDIAN has1105 records (57.40%) with missing values.
Variable GGT_MEAN has1105 records (57.40%) with missing values.
Variable GGT_MIN has1105 records (57.40%) with missing values.
Variable GGT_MAX has1105 records (57.40%) with missing values.
Variable GGT_DIFF has1925 records (100.00%) with missing values.
Variable GLUCOSE_MEDIAN has1105 records (57.40%) with missing values.
Variable GLUCOSE_MEAN has1105 records (57.40%) with missing values.
Variable GLUCOSE_MIN has1105 records (57.40%) with missing values.
Variable GLUCOSE_MAX has1105 records (57.40%) with missing values.
Variable GLUCOSE_DIFF has1925 records (100.00%) with missing values.
Variable HEMATOCRITE_MEDIAN has1105 records (57.40%) with missing values.
Variable HEMATOCRITE_MEAN has1105 records (57.40%) with missing values.
Variable HEMATOCRITE_MIN has1105 records (57.40%) with missing values.
Variable HEMATOCRITE_MAX has1105 records (57.40%) with missing values.
Variable HEMATOCRITE_DIFF has1925 records (100.00%) with missing values.
Variable HEMOGLOBIN_MEDIAN has1105 records (57.40%) with missing values.
Variable HEMOGLOBIN_MEAN has1105 records (57.40%) with missing values.
Variable HEMOGLOBIN_MIN has1105 records (57.40%) with missing values.
Variable HEMOGLOBIN_MAX has1105 records (57.40%) with missing values.
Variable HEMOGLOBIN_DIFF has1925 records (100.00%) with missing values.
Variable INR_MEDIAN has1105 records (57.40%) with missing values.
Variable INR_MEAN has1105 records (57.40%) with missing values.
Variable INR_MIN has1105 records (57.40%) with missing values.
Variable INR_MAX has1105 records (57.40%) with missing values.
Variable INR_DIFF has1925 records (100.00%) with missing values.
Variable LACTATE_MEDIAN has1105 records (57.40%) with missing values.
Variable LACTATE_MEAN has1105 records (57.40%) with missing values.
Variable LACTATE_MIN has1105 records (57.40%) with missing values.
Variable LACTATE_MAX has1105 records (57.40%) with missing values.
Variable LACTATE_DIFF has1925 records (100.00%) with missing values.
Variable LEUKOCYTES_MEDIAN has1105 records (57.40%) with missing values.
Variable LEUKOCYTES_MEAN has1105 records (57.40%) with missing values.
Variable LEUKOCYTES_MIN has1105 records (57.40%) with missing values.
Variable LEUKOCYTES_MAX has1105 records (57.40%) with missing values.
Variable LEUKOCYTES_DIFF has1925 records (100.00%) with missing values.
Variable LINFOCITOS_MEDIAN has1105 records (57.40%) with missing values.
Variable LINFOCITOS_MEAN has1105 records (57.40%) with missing values.
Variable LINFOCITOS_MIN has1105 records (57.40%) with missing values.
Variable LINFOCITOS_MAX has1105 records (57.40%) with missing values.
Variable LINFOCITOS_DIFF has1925 records (100.00%) with missing values.
Variable NEUTROPHILES_MEDIAN has1105 records (57.40%) with missing values.
Variable NEUTROPHILES_MEAN has1105 records (57.40%) with missing values.
Variable NEUTROPHILES_MIN has1105 records (57.40%) with missing values.
Variable NEUTROPHILES_MAX has1105 records (57.40%) with missing values.
Variable NEUTROPHILES_DIFF has1925 records (100.00%) with missing values.
Variable P02_ARTERIAL_MEDIAN has1105 records (57.40%) with missing values.
Variable P02_ARTERIAL_MEAN has1105 records (57.40%) with missing values.
Variable P02_ARTERIAL_MIN has1105 records (57.40%) with missing values.
Variable P02_ARTERIAL_MAX has1105 records (57.40%) with missing values.
Variable P02_ARTERIAL_DIFF has1925 records (100.00%) with missing values.
Variable P02_VENOUS_MEDIAN has1105 records (57.40%) with missing values.
Variable P02_VENOUS_MEAN has1105 records (57.40%) with missing values.
Variable P02_VENOUS_MIN has1105 records (57.40%) with missing values.
Variable P02_VENOUS_MAX has1105 records (57.40%) with missing values.
Variable P02_VENOUS_DIFF has1925 records (100.00%) with missing values.
Variable PC02_ARTERIAL_MEDIAN has1105 records (57.40%) with missing values.
Variable PC02_ARTERIAL_MEAN has1105 records (57.40%) with missing values.
Variable PC02_ARTERIAL_MIN has1105 records (57.40%) with missing values.
Variable PC02_ARTERIAL_MAX has1105 records (57.40%) with missing values.
Variable PC02_ARTERIAL_DIFF has1925 records (100.00%) with missing values.
Variable PC02_VENOUS_MEDIAN has1106 records (57.45%) with missing values.
Variable PC02_VENOUS_MEAN has1106 records (57.45%) with missing values.
Variable PC02_VENOUS_MIN has1106 records (57.45%) with missing values.
Variable PC02_VENOUS_MAX has1106 records (57.45%) with missing values.
Variable PC02_VENOUS_DIFF has1925 records (100.00%) with missing values.
Variable PCR_MEDIAN has1113 records (57.82%) with missing values.
Variable PCR_MEAN has1113 records (57.82%) with missing values.
Variable PCR_MIN has1113 records (57.82%) with missing values.
Variable PCR_MAX has1113 records (57.82%) with missing values.
Variable PCR_DIFF has1925 records (100.00%) with missing values.
Variable PH_ARTERIAL_MEDIAN has1105 records (57.40%) with missing values.
Variable PH_ARTERIAL_MEAN has1105 records (57.40%) with missing values.
Variable PH_ARTERIAL_MIN has1105 records (57.40%) with missing values.
Variable PH_ARTERIAL_MAX has1105 records (57.40%) with missing values.
Variable PH_ARTERIAL_DIFF has1925 records (100.00%) with missing values.
Variable PH_VENOUS_MEDIAN has1105 records (57.40%) with missing values.
Variable PH_VENOUS_MEAN has1105 records (57.40%) with missing values.
Variable PH_VENOUS_MIN has1105 records (57.40%) with missing values.
Variable PH_VENOUS_MAX has1105 records (57.40%) with missing values.
Variable PH_VENOUS_DIFF has1925 records (100.00%) with missing values.
Variable PLATELETS_MEDIAN has1105 records (57.40%) with missing values.
Variable PLATELETS_MEAN has1105 records (57.40%) with missing values.
Variable PLATELETS_MIN has1105 records (57.40%) with missing values.
Variable PLATELETS_MAX has1105 records (57.40%) with missing values.
Variable PLATELETS_DIFF has1925 records (100.00%) with missing values.
Variable POTASSIUM_MEDIAN has1106 records (57.45%) with missing values.
Variable POTASSIUM_MEAN has1106 records (57.45%) with missing values.
Variable POTASSIUM_MIN has1106 records (57.45%) with missing values.
Variable POTASSIUM_MAX has1106 records (57.45%) with missing values.
Variable POTASSIUM_DIFF has1925 records (100.00%) with missing values.
Variable SAT02_ARTERIAL_MEDIAN has1105 records (57.40%) with missing values.
Variable SAT02_ARTERIAL_MEAN has1105 records (57.40%) with missing values.
Variable SAT02_ARTERIAL_MIN has1105 records (57.40%) with missing values.
Variable SAT02_ARTERIAL_MAX has1105 records (57.40%) with missing values.
Variable SAT02_ARTERIAL_DIFF has1925 records (100.00%) with missing values.
Variable SAT02_VENOUS_MEDIAN has1105 records (57.40%) with missing values.
Variable SAT02_VENOUS_MEAN has1105 records (57.40%) with missing values.
Variable SAT02_VENOUS_MIN has1105 records (57.40%) with missing values.
Variable SAT02_VENOUS_MAX has1105 records (57.40%) with missing values.
Variable SAT02_VENOUS_DIFF has1925 records (100.00%) with missing values.
Variable SODIUM_MEDIAN has1105 records (57.40%) with missing values.
Variable SODIUM_MEAN has1105 records (57.40%) with missing values.
Variable SODIUM_MIN has1105 records (57.40%) with missing values.
Variable SODIUM_MAX has1105 records (57.40%) with missing values.
Variable SODIUM_DIFF has1925 records (100.00%) with missing values.
Variable TGO_MEDIAN has1106 records (57.45%) with missing values.
Variable TGO_MEAN has1106 records (57.45%) with missing values.
Variable TGO_MIN has1106 records (57.45%) with missing values.
Variable TGO_MAX has1106 records (57.45%) with missing values.
Variable TGO_DIFF has1925 records (100.00%) with missing values.
Variable TGP_MEDIAN has1105 records (57.40%) with missing values.
Variable TGP_MEAN has1105 records (57.40%) with missing values.
Variable TGP_MIN has1105 records (57.40%) with missing values.
Variable TGP_MAX has1105 records (57.40%) with missing values.
Variable TGP_DIFF has1925 records (100.00%) with missing values.
Variable TTPA_MEDIAN has1105 records (57.40%) with missing values.
Variable TTPA_MEAN has1105 records (57.40%) with missing values.
Variable TTPA_MIN has1105 records (57.40%) with missing values.
Variable TTPA_MAX has1105 records (57.40%) with missing values.
Variable TTPA_DIFF has1925 records (100.00%) with missing values.
Variable UREA_MEDIAN has1105 records (57.40%) with missing values.
Variable UREA_MEAN has1105 records (57.40%) with missing values.
Variable UREA_MIN has1105 records (57.40%) with missing values.
Variable UREA_MAX has1105 records (57.40%) with missing values.
Variable UREA_DIFF has1925 records (100.00%) with missing values.
Variable DIMER_MEDIAN has1138 records (59.12%) with missing values.
Variable DIMER_MEAN has1138 records (59.12%) with missing values.
Variable DIMER_MIN has1138 records (59.12%) with missing values.
Variable DIMER_MAX has1138 records (59.12%) with missing values.
Variable DIMER_DIFF has1925 records (100.00%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MEAN has686 records (35.64%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MEAN has687 records (35.69%) with missing values.
Variable HEART_RATE_MEAN has686 records (35.64%) with missing values.
Variable RESPIRATORY_RATE_MEAN has749 records (38.91%) with missing values.
Variable TEMPERATURE_MEAN has695 records (36.10%) with missing values.
Variable OXYGEN_SATURATION_MEAN has687 records (35.69%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MEDIAN has686 records (35.64%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MEDIAN has689 records (35.79%) with missing values.
Variable HEART_RATE_MEDIAN has686 records (35.64%) with missing values.
Variable RESPIRATORY_RATE_MEDIAN has749 records (38.91%) with missing values.
Variable TEMPERATURE_MEDIAN has695 records (36.10%) with missing values.
Variable OXYGEN_SATURATION_MEDIAN has687 records (35.69%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MIN has686 records (35.64%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MIN has689 records (35.79%) with missing values.
Variable HEART_RATE_MIN has687 records (35.69%) with missing values.
Variable RESPIRATORY_RATE_MIN has789 records (40.99%) with missing values.
Variable TEMPERATURE_MIN has695 records (36.10%) with missing values.
Variable OXYGEN_SATURATION_MIN has688 records (35.74%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_MAX has686 records (35.64%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_MAX has687 records (35.69%) with missing values.
Variable HEART_RATE_MAX has686 records (35.64%) with missing values.
Variable RESPIRATORY_RATE_MAX has749 records (38.91%) with missing values.
Variable TEMPERATURE_MAX has695 records (36.10%) with missing values.
Variable OXYGEN_SATURATION_MAX has687 records (35.69%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_DIFF has1309 records (68.00%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_DIFF has1300 records (67.53%) with missing values.
Variable HEART_RATE_DIFF has1300 records (67.53%) with missing values.
Variable RESPIRATORY_RATE_DIFF has1358 records (70.55%) with missing values.
Variable TEMPERATURE_DIFF has1284 records (66.70%) with missing values.
Variable OXYGEN_SATURATION_DIFF has1294 records (67.22%) with missing values.
Variable BLOODPRESSURE_DIASTOLIC_DIFF_REL has1309 records (68.00%) with missing values.
Variable BLOODPRESSURE_SISTOLIC_DIFF_REL has1300 records (67.53%) with missing values.
Variable HEART_RATE_DIFF_REL has1300 records (67.53%) with missing values.
Variable RESPIRATORY_RATE_DIFF_REL has1358 records (70.55%) with missing values.
Variable TEMPERATURE_DIFF_REL has1284 records (66.70%) with missing values.
Variable OXYGEN_SATURATION_DIFF_REL has1294 records (67.22%) with missing values.
In total, there are 225 variables with missing values

Above method is used to checked the number of NaN values or missing values.

Missing Values¶

In [22]:
pd.DataFrame({"Columns": data_icu.columns,"Missing_values":((data.isna()).sum()/data_icu.shape[0])*100})
Out[22]:
Columns Missing_values
PATIENT_VISIT_IDENTIFIER PATIENT_VISIT_IDENTIFIER 0.000000
AGE_ABOVE65 AGE_ABOVE65 0.000000
AGE_PERCENTIL AGE_PERCENTIL 0.000000
GENDER GENDER 0.000000
DISEASE GROUPING 1 DISEASE GROUPING 1 0.259740
... ... ...
RESPIRATORY_RATE_DIFF_REL RESPIRATORY_RATE_DIFF_REL 38.857143
TEMPERATURE_DIFF_REL TEMPERATURE_DIFF_REL 36.051948
OXYGEN_SATURATION_DIFF_REL OXYGEN_SATURATION_DIFF_REL 35.636364
WINDOW WINDOW 0.000000
ICU ICU 0.000000

231 rows × 2 columns

Above table represent how many missing values there are in our dataset in percent.

Barplot¶

In [23]:
import missingno as msno
msno.bar(data_icu)
Out[23]:
<AxesSubplot:>

We import missingno library for visualization our data. we plot bar graph for visualization of missing values.

Heatmap¶

In [24]:
msno.heatmap(data_icu)
Out[24]:
<AxesSubplot:>

Heat map, which is a two-dimensional visual representation of data, each value in a matrix is represented by a different hue.

In [25]:
corelation = data_icu.corr()
sns.heatmap(corelation, xticklabels=corelation.columns, yticklabels=corelation.columns, annot=True)
Out[25]:
<AxesSubplot:>

Because of large amount of data, many of Nan Values, Unable to find out Correlation between any columns.

Pivot¶

In [26]:
pd.pivot_table(data_icu, index=['ICU', 'GENDER'], columns = ['AGE_ABOVE65'], aggfunc=len)
Out[26]:
AGE_PERCENTIL ALBUMIN_DIFF ALBUMIN_MAX ALBUMIN_MEAN ALBUMIN_MEDIAN ... UREA_MAX UREA_MEAN UREA_MEDIAN UREA_MIN WINDOW
AGE_ABOVE65 0 1 0 1 0 1 0 1 0 1 ... 0 1 0 1 0 1 0 1 0 1
ICU GENDER
0 0 527 336 527 336 527 336 527 336 527 336 ... 527 336 527 336 527 336 527 336 527 336
1 314 233 314 233 314 233 314 233 314 233 ... 314 233 314 233 314 233 314 233 314 233
1 0 143 209 143 209 143 209 143 209 143 209 ... 143 209 143 209 143 209 143 209 143 209
1 41 122 41 122 41 122 41 122 41 122 ... 41 122 41 122 41 122 41 122 41 122

4 rows × 456 columns

  • In ICU

0 - Patient not admitted in ICU
1 - Patient admitted in ICU

0 - Represent Patient is not Critical.
1 - Represent Patient is in Critical condition due to Covid-19.

  • In GENDER
    0 - Male
    1 - Female

Outside of ICU, Females are less critical than Males.
More females than males were admitted to the ICU.

UNIVARIANT EXPLORATION¶

The word "Uni" means "one," therefore "univariate analysis" refers to the study of only one variable at a time.

For univariant observation we use distplot,data distribution of a variable against the density distribution.

In [27]:
sns.distplot(data_icu['TEMPERATURE_DIFF'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Temperature density of the patients countinously surge and reached to 1.25 and then follow decline pattern in betwwen -0.5 to 0.C

In [28]:
sns.distplot(data_icu['OXYGEN_SATURATION_DIFF'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Patient Oxygen saturation Density lies in between -0.1 to 0.5 and reached to peak of around 2.75.

In [29]:
sns.distplot(data_icu['BLOODPRESSURE_DIASTOLIC_DIFF_REL'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

The highest density of a Bloodpressure_diastolic_diff_rel reached at a peak of 1.3 .

In [30]:
sns.distplot(data_icu['BLOODPRESSURE_SISTOLIC_DIFF_REL'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Here the Density peak is 1.2.

In [31]:
sns.distplot(data_icu['HEART_RATE_DIFF_REL'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

The maximum density for HEART_RATE_DIFF_REL IS 1.5.

In [32]:
sns.distplot(data_icu['RESPIRATORY_RATE_DIFF_REL'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

The maximun density for RESPIRATORY_RATE_DIFF_REL is 1.2.

In [33]:
sns.distplot(data_icu['TEMPERATURE_DIFF_REL'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

The maximun density for TEMPERATURE_DIFF_REL is 1.25.

In [34]:
sns.distplot(data_icu['OXYGEN_SATURATION_DIFF_REL'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

The maximun density for OXYGEN_SATURATION_DIFF_REL is 2.8.

In [35]:
sns.distplot(data['ICU'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

The maximun density for ICU is 3.

In [36]:
sns.distplot(data_icu['PATIENT_VISIT_IDENTIFIER'])
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
In [37]:
a=data_icu['ICU'].value_counts()
plt.pie(a,labels = ['NON-ICU', 'ICU'])
plt.show()

According to pie chart more then half patients are not required ICU beds .

BIVARIENT EXPLORATION¶

Analyzing two variables simultaneously is known as bivariate analysis.

In [38]:
sns.boxplot(x="GENDER" , y="AGE_ABOVE65",data = data_icu)
Out[38]:
<AxesSubplot:xlabel='GENDER', ylabel='AGE_ABOVE65'>

Men and women over the age of 65 are both equally affected. covid -19.

In [39]:
sns.boxplot(x="ICU", y="AGE_ABOVE65" , data = data_icu)
Out[39]:
<AxesSubplot:xlabel='ICU', ylabel='AGE_ABOVE65'>

ICU and non-ICU patients with affected COVID rates were equal.

In [40]:
age = sns.countplot(x='AGE_ABOVE65' , hue='GENDER' , data=data_icu)
for p in age.patches:
    height = p.get_height()
    age.text(p.get_x() + p.get_width()/2. , height + 0.1, height, ha="center")
In [41]:
icu=sns.countplot(x="ICU", hue = "GENDER", data = data_icu)
for p in icu.patches:
    height = p.get_height()
    icu.text(p.get_x() + p.get_width()/2., height + 0.1,height, ha="center")

In compare to Female patient, Male covid patient are highly admitted.

Data Preparation¶

DATA CLEANING

It is necessary to clean our data before performing ml algorithm.

In [42]:
drop_cols = ['TEMPRETURE_DIFF' , 'OXYGEN_SATURATION_DIFF', 'BLODDPRESSURE_DIASTOLIC_DIFF_REL' 'BLOODPRESSURE_SISTOLIC_DIFF_REL']
In [43]:
data_icu
Out[43]:
PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 AGE_PERCENTIL GENDER DISEASE GROUPING 1 DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5 DISEASE GROUPING 6 ... TEMPERATURE_DIFF OXYGEN_SATURATION_DIFF BLOODPRESSURE_DIASTOLIC_DIFF_REL BLOODPRESSURE_SISTOLIC_DIFF_REL HEART_RATE_DIFF_REL RESPIRATORY_RATE_DIFF_REL TEMPERATURE_DIFF_REL OXYGEN_SATURATION_DIFF_REL WINDOW ICU
0 0 1 60 0 0.0 0.0 0.0 0.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 2 0
1 0 1 60 0 0.0 0.0 0.0 0.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 4 0
2 0 1 60 0 0.0 0.0 0.0 0.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 6 0
3 0 1 60 0 0.0 0.0 0.0 0.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 12 0
4 0 1 60 0 0.0 0.0 0.0 0.0 1.0 1.0 ... -0.238095 -0.818182 -0.389967 0.407558 -0.230462 0.096774 -0.242282 -0.814433 13 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1920 384 0 50 1 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 2 0
1921 384 0 50 1 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 4 0
1922 384 0 50 1 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 6 0
1923 384 0 50 1 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 12 0
1924 384 0 50 1 0.0 0.0 1.0 0.0 0.0 0.0 ... -0.547619 -0.838384 -0.701863 -0.585967 -0.763868 -0.612903 -0.551337 -0.835052 13 0

1925 rows × 231 columns

In [44]:
dataset = data_icu.copy()
In [45]:
data = data_icu.fillna(0)
In [46]:
dataset = data.fillna(0)

The fillna() function substitutes a given value for any NULL values. We replaced Null values with 0.

In [47]:
import missingno as msno
msno.bar(data)
Out[47]:
<AxesSubplot:>

We import missingno library for visualization our data. we plot bar graph for visualization of missing values.After filling nun value.

Identified Outliers¶

An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.

In [48]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=2, cols=4)
fig.add_trace(go.Box(y=data_icu['BLOODPRESSURE_SISTOLIC_MAX'],name='BLOODPRESSURE_SISTOLIC_MAX'),row=1,col=1)
fig.add_trace(go.Box(y=data_icu['HEART_RATE_MAX'],name='HEART_RATE_MAX'),row=1,col=2)
fig.add_trace(go.Box(y=data_icu['RESPIRATORY_RATE_MAX'],name='RESPIRATORY_RATE_MAX'),row=1,col=3)
fig.add_trace(go.Box(y=data_icu['TEMPERATURE_MAX'],name='TEMPERATURE_MAX'),row=1,col=4)
fig.add_trace(go.Box(y=data_icu['OXYGEN_SATURATION_MAX'],name='OXYGEN_SATURATION_MAX'),row=2,col=1)
fig.add_trace(go.Box(y=data_icu['BLOODPRESSURE_DIASTOLIC_DIFF'],name='BLOODPRESSURE_DIASTOLIC_DIFF'),row=2,col=2)
fig.add_trace(go.Box(y=data_icu['BLOODPRESSURE_SISTOLIC_DIFF'],name='BLODDPRESSURE_SISTOLIC_DIFF'),row=2,col=3)
fig.add_trace(go.Box(y=data_icu['HEART_RATE_DIFF'],name='HEART_RATE_DIFF'),row=2,col=4)
fig.show()
In [49]:
data_icu.BLOODPRESSURE_SISTOLIC_MAX.mean()
data_icu.BLOODPRESSURE_SISTOLIC_MAX.std()
data_icu.BLOODPRESSURE_SISTOLIC_MAX.describe()
Out[49]:
count    1238.000000
mean       -0.398612
std         0.286796
min        -0.989189
25%        -0.578378
50%        -0.459459
75%        -0.243243
max         1.000000
Name: BLOODPRESSURE_SISTOLIC_MAX, dtype: float64
In [50]:
upper_limit = data.BLOODPRESSURE_SISTOLIC_MAX.mean() + 3*data_icu.BLOODPRESSURE_SISTOLIC_MAX.std()
upper_limit
Out[50]:
0.6040340948294227

According to our result upper result is 0.60

In [51]:
lower_limit = data_icu.BLOODPRESSURE_SISTOLIC_MAX.mean() - 3*data_icu.BLOODPRESSURE_SISTOLIC_MAX.std()
lower_limit
Out[51]:
-1.2589994388123775

According to our result, lower limit -1.25

ML ALgorithm.¶

In [52]:
x=dataset.drop('ICU', axis=1)
x
Out[52]:
PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 AGE_PERCENTIL GENDER DISEASE GROUPING 1 DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5 DISEASE GROUPING 6 ... RESPIRATORY_RATE_DIFF TEMPERATURE_DIFF OXYGEN_SATURATION_DIFF BLOODPRESSURE_DIASTOLIC_DIFF_REL BLOODPRESSURE_SISTOLIC_DIFF_REL HEART_RATE_DIFF_REL RESPIRATORY_RATE_DIFF_REL TEMPERATURE_DIFF_REL OXYGEN_SATURATION_DIFF_REL WINDOW
0 0 1 60 0 0.0 0.0 0.0 0.0 1.0 1.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2
1 0 1 60 0 0.0 0.0 0.0 0.0 1.0 1.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4
2 0 1 60 0 0.0 0.0 0.0 0.0 1.0 1.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6
3 0 1 60 0 0.0 0.0 0.0 0.0 1.0 1.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 12
4 0 1 60 0 0.0 0.0 0.0 0.0 1.0 1.0 ... 0.176471 -0.238095 -0.818182 -0.389967 0.407558 -0.230462 0.096774 -0.242282 -0.814433 13
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1920 384 0 50 1 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2
1921 384 0 50 1 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4
1922 384 0 50 1 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6
1923 384 0 50 1 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 12
1924 384 0 50 1 0.0 0.0 1.0 0.0 0.0 0.0 ... -0.647059 -0.547619 -0.838384 -0.701863 -0.585967 -0.763868 -0.612903 -0.551337 -0.835052 13

1925 rows × 230 columns

In [53]:
y=dataset['ICU']
y
Out[53]:
0       0
1       0
2       0
3       0
4       1
       ..
1920    0
1921    0
1922    0
1923    0
1924    0
Name: ICU, Length: 1925, dtype: int64

Split the data into Two part¶

we are spliting our data for running our machine learning modelieng.

In [54]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test=train_test_split(x,y,test_size=0.2, random_state=95)
In [55]:
x_train
Out[55]:
PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 AGE_PERCENTIL GENDER DISEASE GROUPING 1 DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5 DISEASE GROUPING 6 ... RESPIRATORY_RATE_DIFF TEMPERATURE_DIFF OXYGEN_SATURATION_DIFF BLOODPRESSURE_DIASTOLIC_DIFF_REL BLOODPRESSURE_SISTOLIC_DIFF_REL HEART_RATE_DIFF_REL RESPIRATORY_RATE_DIFF_REL TEMPERATURE_DIFF_REL OXYGEN_SATURATION_DIFF_REL WINDOW
601 120 1 80 0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 4
1836 367 1 90 1 1.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 4
1821 364 1 90 0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 4
280 56 0 50 1 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 2
547 109 0 30 0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 6
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
773 154 0 40 1 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 12
118 23 0 40 1 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 -0.761905 0.0 0.0 0.0 0.0 0.0 -0.767053 0.0 12
1555 311 1 60 0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 2
1321 264 1 60 1 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 4
1430 286 0 40 0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 2

1540 rows × 230 columns

In [56]:
y_train
Out[56]:
601     0
1836    0
1821    1
280     0
547     1
       ..
773     0
118     0
1555    0
1321    0
1430    1
Name: ICU, Length: 1540, dtype: int64
In [57]:
x_test
Out[57]:
PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 AGE_PERCENTIL GENDER DISEASE GROUPING 1 DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5 DISEASE GROUPING 6 ... RESPIRATORY_RATE_DIFF TEMPERATURE_DIFF OXYGEN_SATURATION_DIFF BLOODPRESSURE_DIASTOLIC_DIFF_REL BLOODPRESSURE_SISTOLIC_DIFF_REL HEART_RATE_DIFF_REL RESPIRATORY_RATE_DIFF_REL TEMPERATURE_DIFF_REL OXYGEN_SATURATION_DIFF_REL WINDOW
1647 329 1 90 0 1.0 1.0 0.0 0.0 1.0 0.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 6
1607 321 1 60 1 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 6
1421 284 0 20 0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 4
1 0 1 60 0 0.0 0.0 0.0 0.0 1.0 1.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 4
1220 244 1 60 1 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1563 312 0 20 0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 12
1102 220 1 100 0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 6
1680 336 0 10 1 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 2
442 88 0 10 1 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 6
1096 219 1 100 1 1.0 0.0 0.0 0.0 1.0 0.0 ... -0.529412 -0.928571 -0.979798 -0.97756 -0.886472 -0.989267 -0.697442 -0.929831 -0.979601 4

385 rows × 230 columns

In [58]:
y_test
Out[58]:
1647    1
1607    0
1421    0
1       0
1220    0
       ..
1563    0
1102    0
1680    0
442     0
1096    1
Name: ICU, Length: 385, dtype: int64

Decision Tree¶

One of the strongest and most well-liked algorithms is the decision tree. The decision-tree algorithm is a type of supervised learning method. It functions with output variables that are categorised and continuous.

In [59]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
dtc=DecisionTreeClassifier()
dtc.fit(x_train, y_train)
Out[59]:
DecisionTreeClassifier()
In [60]:
y_pred1=dtc.predict(x_test)
In [61]:
from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_pred1,y_test)
Out[61]:
0.7688311688311689

The accuracy of this model is 0.78

In [62]:
confusion_matrix(y_pred1,y_test)
Out[62]:
array([[242,  47],
       [ 42,  54]], dtype=int64)
In [63]:
plt.figure(figsize=(40,40)) # set plot size (denoted in inches)
plot_tree(dtc, filled=True, fontsize=10)
plt.show()

Random Forest¶

Random forest selects a random sample from the training set, creates a decision tree for it and gets a prediction; it repeats this operation for the assigned number of the trees, performs a vote for each prediction, and takes the result with the majority of votes (in case of classification) or the average

In [64]:
from sklearn.ensemble import RandomForestClassifier
regressor = RandomForestClassifier()
regressor.fit(x_train, y_train)
Out[64]:
RandomForestClassifier()
In [65]:
y_pred=regressor.predict(x_test)
In [66]:
from sklearn import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test,y_pred))
print('Mean Squared Error:' , metrics.mean_squared_error(y_pred,y_test))
print('Root Mean Squared Error:',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
print('R-Squared', r2_score(y_pred, y_test))
Mean Absolute Error: 0.14545454545454545
Mean Squared Error: 0.14545454545454545
Root Mean Squared Error: 0.3813850356982369
R-Squared 0.09090909090909127
In [67]:
from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_pred,y_test)
Out[67]:
0.8545454545454545

The accuracy for this model is 0.841

In [68]:
confusion_matrix (y_pred,y_test)
Out[68]:
array([[268,  40],
       [ 16,  61]], dtype=int64)
In [69]:
plt.figure(figsize=(8,6))
plt.plot(y_test,y_test,color='deeppink')
plt.scatter(y_test,y_pred,color='dodgerblue')
plt.xlabel('Actual Target Value',fontsize=15)
plt.ylabel('Predicted Traget Value',fontsize=15)
plt.title('Random Forest Regressor',fontsize=14)
plt.show()

XG Boost¶

XGBoost is used for supervised learning problems, where we use the training data (with multiple features) to predict a target variable.

In [70]:
from xgboost import XGBClassifier
Classifier = XGBClassifier()
In [71]:
Classifier.fit(x_train, y_train)
Out[71]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_threshold=64, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0, ...)
In [72]:
y_pred2=Classifier.predict(x_test)
In [73]:
accuracy_score(y_pred2,y_test)
Out[73]:
0.8831168831168831

The accuracy for this model is 0.88

In [74]:
confusion_matrix(y_pred2,y_test)
Out[74]:
array([[270,  31],
       [ 14,  70]], dtype=int64)

SVM¶

Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges. However, it is mostly used in classification problems.

In [75]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
In [76]:
svm = SVC()

param_grid = { 
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001]
}

cv_svm = GridSearchCV(estimator=svm, param_grid=param_grid, cv=5)
cv_svm.fit(x_train, y_train.values.ravel())

print("Support Vector Machines Model best params: ", cv_svm.best_params_)

# Training model with best params
best_params = cv_svm.best_params_

svm_best = SVC(random_state = 42, 
              C = best_params['C'], 
              gamma = best_params['gamma'])

svm_best.fit(x_train, y_train)
y_pred = svm_best.predict(x_test)

# Evaluating the model
print("Accuracy for Support Vector Machines is : ", round(accuracy_score(y_test, y_pred), 2))

print("\n\nClassification report for Support Vector Machines:")
print(classification_report(y_test, y_pred))

print("\n\nConfusion matrix for Support Vector Machines:")
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True)
Support Vector Machines Model best params:  {'C': 100, 'gamma': 0.001}
Accuracy for Support Vector Machines is :  0.83


Classification report for Support Vector Machines:
              precision    recall  f1-score   support

           0       0.87      0.91      0.89       284
           1       0.71      0.61      0.66       101

    accuracy                           0.83       385
   macro avg       0.79      0.76      0.77       385
weighted avg       0.83      0.83      0.83       385



Confusion matrix for Support Vector Machines:
Out[76]:
<AxesSubplot:>

The accuracy for this model is 0.83.

Modeling and Model Evaluation¶

Ensemble Voting Method¶

In [77]:
clfs = {"SVM":SVC(kernel='rbf', probability=True),
       "DecisionTree":DecisionTreeClassifier(),
       "RandomForest":RandomForestClassifier(),
       "XGBoost":XGBClassifier(verbosity=0)}
In [78]:
import math
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
def model_fit(clfs):
    
    fitted_model = {}
    model_result = pd.DataFrame()
    for model_name, model in clfs.items():
        model.fit(x_train,y_train)
        fitted_model.update({model_name:model})
        y_pred =model.predict(x_test)
        model_dict = {}
        model_dict['1.Algorithm'] = model_name
        model_dict['2.Accuracy'] = round(accuracy_score(y_test, y_pred),3)
        model_dict['3.Precision'] = round(precision_score(y_test,y_pred),3)
        model_dict['4.Recall'] = round(recall_score(y_test,y_pred),3)
        model_dict['5.F1'] = round(f1_score(y_test, y_pred),3)
        model_dict['6.ROC'] = round(roc_auc_score(y_test, y_pred),3)
        model_result = model_result.append(model_dict,ignore_index=True)
    return fitted_model, model_result
     
In [79]:
fitted_model, model_result = model_fit(clfs)
model_result.sort_values(by=['2.Accuracy'],ascending=False)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.

C:\Users\rahul\AppData\Local\Temp\ipykernel_20408\3221118475.py:21: FutureWarning:

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

C:\Users\rahul\AppData\Local\Temp\ipykernel_20408\3221118475.py:21: FutureWarning:

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

C:\Users\rahul\AppData\Local\Temp\ipykernel_20408\3221118475.py:21: FutureWarning:

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

C:\Users\rahul\AppData\Local\Temp\ipykernel_20408\3221118475.py:21: FutureWarning:

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

Out[79]:
1.Algorithm 2.Accuracy 3.Precision 4.Recall 5.F1 6.ROC
3 XGBoost 0.883 0.833 0.693 0.757 0.822
2 RandomForest 0.849 0.795 0.574 0.667 0.761
1 DecisionTree 0.764 0.553 0.515 0.533 0.683
0 SVM 0.738 0.000 0.000 0.000 0.500
In [80]:
model_ordered = []
weights = []
i=1
for model_name in model_result['1.Algorithm'][
    index_natsorted(model_result['2.Accuracy'],reverse=False)]:
    model_ordered.append([model_name,clfs.get(model_name)])
    weights.append(math.exp(i))
    i+=0.8
In [81]:
plt.plot(weights)
plt.show()
In [82]:
weights
Out[82]:
[2.718281828459045, 6.0496474644129465, 13.463738035001692, 29.964100047397025]

Conclusion¶

The best algorithm for brazil covid - 19 dataset is XGBoost, providing the best Performance.

Executive Summary¶

Student ID: 21524331, Rahul Ladhani¶

Introduction In this model, We will utilize the Hospital Sirio-Libanes data set to determine if the covid patient needs an ICU bed or not. A machine learning strategy was used to maximize the usage of ICU beds because there were not enough of them due to the covid pandemic crisis in Brazil. Therefore, depending on the patient's present medical status, we will utilize the data set to attempt to categorise whether the patient would need an ICU bed.

Basically, the ML model divided into two parts 1 EDA – Data quality issues, univariant, bivariant, data preparation(data cleaning ).
2 ML algorithm - Decision Tree, Random forest, XG Boost, and SVM. Modeling and Model Evaluation

Exploratory Data Analysis (EDA) In order to recognize the inconsistencies and missing values in the data set, we performed exploratory data analysis on it. The following is a list of some of the data set's examination: The data set contained 1925 raw, 231 columns. Having some missing values. We found some objects were strings(AGE_Percitile and ICU), so we converted them into float. We plot some graphs for better visualization. Bar plot, heat map and Pivot. Univariant exploration - The word "Uni" means "one," therefore "univariate analysis" refers to the study of only one variable at a time. For univariant observation, we use distplot, data distribution of a variable against the density distribution. We visualized variables to understand the data distribution. Bivariant exploration - Analysing two variables simultaneously is known as bivariate analysis. We plot graphs between, gender and age above 65, ICU and age. For an understanding of the distribution of ICU beds.

Data preparation: we clean the data for our machine-learning modelling by using the fillina method which replaces nun values with 0. Then plot a graph to understand the nun values after fillina method.

Identified Outliers: An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset. We plot Outliers for variables and check upper and lower values for BLOODPRESSURE_SISTOLIC_MAX.

Machine learning (ML): To perform the ml algorithm first we have to train the data model into x and y – test and train, In order to train the model, 80% of the data set was used, while the remaining 20% was used for model assessment(test).

The sklearn machine learning library has been used to generate machine learning classification models. The models we will design are:

  1. Decision Tree Classifier
  2. Random Forest
  3. XG Boost
  4. Support Vector Machines

1 . Decision Tree Classifier :The Decision Tree model is performing is a bit less accurate. The accuracy we get is 0.78. we plotted the graph and show the confusion matrix as well.

  1. Random Forest: The Random Forest model performs with different accuracy for both classes like the Decision tree. But the accuracy is high 0.85.

3 . XG boost : XG boost model shows the highest accuracy compared to all models and the accuracy is 0.88.

4 Support Vector Machines: The SVM model is predicted a little bit accurately for the patients who don’t require ICU beds. The overall accuracy is 0.83.

Modeling and Model Evaluation Ensemble Voting Method

We performed an ensemble voting method for representing our models and their feature in a table and graph.

Conclusion : In the project, We were able to use machine learning technology to determine if the patient would need an ICU bed upon hospital admission. We were able to achieve the best accuracy of 88% in predicting if the patient admitted to the hospital would eventually be required to admit to the ICU. In order to improve the accuracy and predictions from the model, we may strive to collect additional data and aim to attain even greater accuracy in future work.